Making Indian Language Legacy Documents Accessible Via Web

نویسندگان

  • Abhishek Kashyap
  • Sanjeev Kohli
  • Santanu Chaudhury
  • S. D. Joshi
چکیده

The reliable optical character recognition is not available for scripts of Indian languages. Thus, the only way to make legacy documents in Indian languages available on the web is by scanning them. This work is an attempt to cater to the need for a better representation and efficient storage technique for Indian language documents and their near perfect regeneration at the browser. We work with the segments (corresponding to text, image or white spaces) extracted from the original document page. For compressing the segments separately, we use Shape-Adaptive Wavelet based coding scheme, Run Length encoding and Arithmetic Bit-plane coding. An XML representation scheme is being used to represent the document page and the data is stored at a server. A plug-in has been implemented that decodes the data encoded coming from the server and displays the document page on the web browser thereby making the document pages web accessible. keywords: document image analysis, shape adaptive compression, entropy based quantization, eBooks

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Making Legacy Data Accessible for Xml Applications

This paper presents design and implementation of DB2XML, a tool for transforming data from relational databases into XML documents. Document type declarations (DTDs) are generated describing the characteristics of the data making the documents self contained and usable as a data exchange format. DB2XML is written in Java and accesses databases through JDBC drivers. It can be used as a standalon...

متن کامل

Devising Interactive Access Techniques for Indian Language Document Images

A large volume of legacy documents in Indian languages exist only in paper form. Web based interactive access techniques for images of these documents can ensure wider dissemination and easy availability. In this paper, we have proposed an access mechanism based on word based indexing and personalized annotation. The word based indexing scheme exploits typical structural characteristics of Indi...

متن کامل

Reverse Engineering Interaction Plans for Legacy Interface Migration

Legacy interface migration is becoming an increasingly important IT activity; many organizations are interested in cost effective and low risk processes for making their legacy systems accessible to new, webbased platforms. Most migration techniques proposed to date require a lot of human expertise. In this paper we discuss Mathaino, an intelligent, multi platform, semi-automated, and low risk ...

متن کامل

Retrieval of Legal Documents: Combining Structured and Unstructured Information

Legal information is often accessible via portal web sites. Legal documents typically combine structured and unstructured information, the former being tagged with markup languages such as XML (Extensible Markup Language). Current information retrieval research takes into account the structured information content of documents when computing the relevance ranking. Such an approach is very promi...

متن کامل

FRBR-ML: A FRBR-based framework for semantic interoperability

Metadata related to cultural items such as literature, music and movies is a valuable resource that is currently exploited in many applications and services based on semantic web technologies. A vast amount of such information has been created by memory institutions in the last decades using different standard or ad hoc schemas, and a main challenge is to make this legacy data accessible as reu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001